80

­

This historical overview also summarizes briefly the essential problems and tasks of

databases: Ideally, each sequence is viewed by hand, analyzed with various bioinformatics

programs, and then accurately labeled. This is a lot of work, typically referred to as data­

base maintenance. Since data sets in bioinformatics usually grow very quickly, this data­

base maintenance is a chronic problem, often exacerbated by the fact that new databases

are usually created by a new project and then not maintained after the PhD thesis or post­

doctoral project ends. Only a few large institutions, which are mentioned here and at other

places in the book, have enough staff to nevertheless maintain really well-maintained data,

in particular the NCBI, the EBI and the SBI (Swiss Bioinformatics Institute).

Other problems of databases are cross-linking to other data (this is also difficult due to

the constant growth of data), maintenance of content (especially when new types of con­

tent are added), the number of errors or outdated entries.

For the protein databases UniProt and PDB (one of the oldest bioinformatics databases,

since the 1960s of the last century), as for many other databases, the uniform formatting

of entries is a problem. And of course it is not only difficult for BLAST to find entries

quickly and accurately in constantly growing databases. There are the two problems of

recall (sensitivity; how many of the hits are also stored in the database as real entries?) and

precision (specificity; do I find exactly what I am looking for or does my program suspect

that it could be half the database?).

6  Extremely Fast Sequence Comparisons Identify All the Molecules That Are Present…